home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Eagles Nest BBS 6
/
Eagles_Nest_Mac_Collection_Disc_6.TOAST
/
Other Macintosh Text
/
68060info
/
68060 Information
Wrap
Text File
|
1993-07-13
|
23KB
|
387 lines
============================================================================
The Motorola 68060 Microprocessor
Joe Circello and Floyd Goodrich
Microprocessor and Memory Technology Group
Motorola Inc.
Abstract
The Motorola 68060 is the fourth generation microprocessor of the
M68000 Family. User object code compatible with previous family members,
it delivers 3 to 3.5 times the performance of the previous generation
processor in this family, the 68040. Performance features include a
superscalar integer unit, a high-performance floating point unit, dual
8 Kbyte on-chip caches, a branch cache, and on-chip memory management units.
A streamlined design enables high-performance techniques to achieve a high
level of parallel instruction execution. Improved performance at a low
cost makes the 68060 an ideal processor for the mid to high range of
desktop computing applications, and compatibility features enable it to
easily upgrade performance of existing 68040-based systems. This paper
describes the operation of the 68060.
Figure 1 - Simplified 68060 Block Diagram (***NOT SHOWN IN ASCII-ONLY COPY***)
Introduction and Overview
The 68060 is the fourth generation microprocessor of Motorola's
M68000 Family of CISC micro-processors. It is a single-chip implementation
that employs a deep pipeline, dual-issue superscalar execution, a branch
cache, a high performance floating point unit (FPU), 8 Kbytes each of
on-chip instruction and data caches, on-chip demand paged memory management
units (MMUs). These features allow it to achieve execution rates of less
than one clock per instruction sustained execution.
In order to meet the performance goals of the 68060, instruction
execution times needed to decrease, and parallel operations needed to
increase over previous generations of M68000 microprocessors. A superscalar
instruction dispatch micro-architecture is the most obvious feature of this
increased parallelism on the 68060. Superscalar architectures are
distinguished by their ability to dispatch two or more instructions per clock
cycle from an otherwise conventional instruction stream.
Figure 1 shows a block diagram of the 68060. In addition to the
superscalar features, this single chip has many other performance, upgrade
and system integration features including:
* 100% user-mode object code compatibility with 68040
* Dual-issue superscalar instruction dispatch implementation of
M68000 architecture
* IEEE Compatible on-chip FPU
* Branch Cache to minimize pipeline refill latency
* Separate 8 Kbyte on-chip instruction and data caches with
simultaneous access
* Bus Snooping
* 68040 compatible bus protocol or new high-speed bus protocol
* 32-bit nonmultiplexed address and data bus
* Four-entry write buffer
* Concurrent operation of Integer Unit, FPU, MMUs, caches,
Bus Controller, and Pipeline
* Sophisticated power management subsystem
* Low-power 3.3V operation
* JTAG Boundary Scan
Design Targets
The design goals of the 68060 included providing a simple upgrade
path for existing M68000 Family designs while also supplying a basis for
Motorola's successful 68EC0x0 Family of embedded controllers and for the
68300 Family of modular integrated controllers.
Initial requirements for the targeted 68060 were to provide a factor
of three performance enhancement over a 25 MHz 68040 with existing compiler
technology. Architectural enhancements were to provide at least a 50%
improvement while doubling clock frequency doubles performance. The
performance estimates reflect analysis of existing object code; additional
performance advantages are, of course, available when using compilers
designed specifically for the 68060.
In addition to software compatibility, the 68060 preserves the
investment in board-level ASICs by providing bus compatibility with the
68040. This supersocket approach facilitates upgrade of all existing and
future 68040-based systems.
The 68060 uses approximately 2.4 million transistors. The part is
a static CMOS design based on a 0.5 um triple level metal wafer process.
This process will enable the 68060 to operate at a 3.3 volt power supply--
a greater than 50% power reduction over a 5.0 volt power supply. Since the
68060 minimizes power dissipation through a variety of architectural and
circuit techniques, it is able to offer high performance processing to the
laptop and portable markets in addition to the traditional computer-system
markets.
Architectural Features
The architecture of the 68060 revolves around its novel integer unit
pipeline. Taking advantage of many of the same performance enhancements
used by RISC designs as well as developing new architectural techniques,
the 68060 harnesses new levels of performance for the M68000 Family.
The superscalar micro-architecture actually consists of two distinct
parts: a four-stage instruction fetch pipeline (IFP) responsible for
accessing the instruction stream and dual four-stage operand execution
pipelines (OEPs) which perform the actual instruction execution. These
pipeline structures operate in an independent manner with a FIFO instruction
buffer providing the decoupling mechanism. A branch cache minimizes the
latency effects of change of flow instructions by allowing the IFP to
detect changes in the instruction prefetch stream well in advance of
their actual execution by the OEPs.
The 68060 is a full internal Harvard architecture. The instruction
and data caches are designed to support concurrent instruction fetch
and operand read and operand write references on every clock cycle. This
organization coupled with a multi-ported register file provide the
necessary bandwidth to maximize the throughput of the pipelines. The
operand execution pipelines operate in a lock-stepped manner that provides
simultaneous, but not out-of-order, program execution. The net result is
a machine architecture invisible to existing applications providing full
support of the M68000 programming model including precise exceptions.
The 68060 external bus interface provides a superset of 68040
functionality. Maintaining 32-bit widths on both the address and data
bus as well as a bursting protocol for cacheable memory, the 68060 supports
transfers of one, two, four, or 16 bytes in a given bus cycle. The system
designer can, however, choose to operate in one of two modes: a mode
compatible with the 68040 protocol or a new mode consistent with higher
frequency bus designs. By allowing this choice, the 68060 can easily fit
into upgrades of existing designs as well as new high frequency
implementations.
Pipeline Organization
The IFP is responsible for prefetching instructions and loading them
into the FIFO instruction buffer. One key aspect of the design is the branch
cache, which allows the IFP to detect changes in the instruction stream
based on past execution history. This allows the IFP to provide a
constant stream of instructions to the instruction buffer to maximize
the execution rates of the OEPs. The IFP is implemented as a four-stage
design shown in Figure 2.
Figure 2 - The IFP of the 68060 (***NOT SHOWN IN ASCII-ONLY COPY***)
The four stages of the IFP are: Instruction Address Generation (IAG),
Instruction Cache (IC), Instruction Early Decode (IED) and Instruction
Buffer (IB). The instruction and branch caches are integral components of
the IFP.
Four operations can be occurring concurrently in the IFP. The
IAG stage calculates the next prefetch address from a number of possible
sources. The variable length of the M68000 Family instruction set as well
as change-of-flow detection make this stage critical to the performance of
the 68060. After the IAG sends the appropriate address to the instruction
cache, the IC stage of the IFP is responsible for performing the cache lookup
and fetching the bit pattern of the instruction. The IED stage of the
pipeline analyzes the bytes fetched from the instruction stream and builds
an extended operation word. This lookup stage effectively converts the
variable-length instruction with multiple formats into a fixed-length
extended operation word that is used by the OEPs in all subsequent processing.
At the conclusion of the IED stage, the prefetched bytes along with the
extended operation word issue into the instruction buffer The IB stage
reads instructions from the 96-byte FIFO buffer and loads them into
the dual OEPs. The FIFO effectively decouples the operation of the IFP from
the operations of the dual OEPs.
Figure 3 - The OEP Units of the 68060 (***NOT SHOWN IN ASCII-ONLY COPY***)
Consecutive instructions issue from the FIFO instruction buffer into
the instruction registers of the dual OEPs. The operand execution pipelines,
known as the primary OEP (pOEP) and the secondary OEP (sOEP), are partitioned
into a 4-stage implementation depicted in Figure 3. The four stages of the
OEPs are: Decode and Select (DS), operand Address Generation (AG), Operand
Cycle (OC) and the EXecute cycle (EX). For instructions writing data to
memory, there are two additional pipeline stages: the Data Available (DA)
and Store (ST) cycles.
The Decode and Select stage of the OEPs provides two primary
functions: this stage determines the next state for the entire operand
pipeline and selects the components required for operand address calculation.
To determine the next state of the OEPs, the DS cycle logic tests the
extended operation words to ascertain the number of instructions that can
issue into the AG stage. If multiple instructions can issue into the AG
stages in parallel, the first and second instructions move into the
respective AG stages. If only a single instruction can issue because of
architectural constraints, the first instruction issues into the pOEP, and
the DS stage evaluates the second and third instructions as a pair during
the next clock cycle. The net effect is a sliding 2-instruction window to
examine possible pairs of instructions for parallel execution. A dedicated
adder located in the AG stage sums the three components of the effective
address: the base, the index and the displacement.
The Operand Cycle (OC) of the OEPs performs the actual fetch of
operands required by the instruction. For memory operands, the OEP accesses
the data cache in this cycle to retrieve the data. For register operands,
the OEP accesses the register file containing all the general-purpose
registers during the OC stage. At the conclusion of the OC cycle, the execute
engines receive the required operands. The EXecute cycle (EX) performs the
operations required to complete the instruction execution including updating
the condition codes. If the destination of the instruction is a data or an
address register, the result is available at the end of the EX stage; if
the destination is a memory location, the operation requires two additional
cycles. First, there is a Data Available (DA) stage where the destination
operand issues to the data cache, which aligns the operand. Second, updates
to the data cache occur during the STore (ST) cycle. Additionally, there is
a four longword FIFO write buffer that is selectable on a page basis and
serves to decouple the operation of the OEPs from external bus cycles.
Since this is an order-two superscalar machine (dual instruction
issue), the sOEP is conceptually a copy of the pOEP. A notable exception to
this concept is the fact that the sOEP executes only a subset of the complete
instruction set. As an example, the floating point execute engine resides
only in the pOEP. Consequently, all floating point instructions must execute
only the pOEP. As instructions travel down the OEPs, they remain
lock-stepped. This insures that there is no out-of-order execution and thus
greatly simplifies support for the precise exception model of the M68000
Family. The micro-architecture of the 68060 supports a number of optimizations
to increase the number of superscalar instruction dispatches. In internal
evaluations of traces from existing object code totaling several billion
instructions, 50% to 60% of instructions execute as pairs.
From the preceding discussion concerning the operand pipeline stages,
all data cache read references occur in the OC stage while data cache write
references occur in the ST stage. The data cache uses a 4-way interleaving
scheme to allow simultaneous operand read and write operations from both
OEPs. The data cache directories are a single-ported design. As a result,
within a superscalar pair of instructions, the 68060 only allows a single
operand memory reference. The data cache also supports single-cycle
references of 64-bit double-precision floating-point operands.
A common drawback to long pipelines is the penalty associated with
refilling the pipeline when a change of program flow occurs. Condition code
evaluation occurs in the EX stage, but waiting for a branch instruction to
reach this point needlessly restricts performance. Instead, the 68060
contains a 256-entry Branch Cache (BC) which predicts the direction of a
branch based on past execution history well in advance of the actual
evaluation of condition codes.
The BC stores the Program Counter value of change-of-flow instructions
as well as the target address of those branches. The BC also uses some
history bits to track how each given branch instruction has executed in the
past. The 68060 checks the BC during the IC stage of the IFP, the same stage
that performs the lookup into the Instruction Cache. If the BC indicates that
the instruction is a branch and that this branch should be predicted as taken,
the IAG pipeline stage is updated with the target address of the branch
instead of the next sequential address. This approach, along with the
instruction folding techniques that the BC uses, allow the 68060 to achieve a
zero-clock latency penalty for correctly predicted taken branches.
If the BC predicts a branch as not-taken, there is no discontinuity
in the instruction prefetch stream. The IFP continues to fetch instructions
sequentially. Eventually, the not-taken branch instruction executes as a
single-clock instruction in the OEP, so correctly predicted not-taken
branches require a single clock to execute. These predicted as not-taken
branches allow a superscalar instruction dispatch, so in many cases, the next
instruction executes simultaneously in the sOEP.
The 68060 performs the actual condition code checking to evaluate the
branch conditions in the EX stage of the OEP. If a branch has been
mispredicted, the 68060 discards the contents of the IFP and the OEPs, and
the 68060 resumes fetching of the instruction stream at the correct location.
To refill the pipeline in this manner, there is a seven-clock penalty for a
mispredicted branch. If the BC correctly predicted the branch, the OEPs
execute seamlessly with no pipeline stalls. Internal studies of the
prediction algorithm used on the 68060 show greater than 90% accuracy from
statistics gathered from several billions of instructions from applications
across many runtime environments.
Floating Point Unit
The floating point unit (FPU) of the 68060 provides complete binary
compatibility with previous M68000 Family floating point solutions. The
68060 performs all internal operations in 80-bit extended precision and
completely supports the IEEE 754 floating point standard.
Conceptually, the FPU appears as another execute engine in the EX
stage of the pOEP. A 64-bit data path between the data cache and the FPU
optimizes the FPU for single-cycle references of 32- or 64-bit memory
operands. As previously noted, all floating point instructions must execute
through the pOEP. However, integer instructions can be simultaneously
dispatched into the sOEP with most FPU instructions, and the 68060 supports
overlap between the integer execute engines and the FPU. Once a multi-cycle
FPU instruction is dispatched, the pOEP and sOEP continue to dispatch and
complete integer instructions (including change-of-flow instructions) until
another FPU instruction is encountered. At this point, the OEPs stall until
the FPU execute engine is available for the next instruction.
The FPU's internal organization consists of three units: the adder,
the multiplier and the divider. The 68060's design does not support
concurrent floating point execution; only one of these functional units is
active at a time. Table 1 shows execution times for the 68060 FPU.
Instruction CPU Clocks
FMOVE 1
FADD 3
FMUL 4
FDIV 24
FSQRT 66
Table 1 - 68060 Floating Point Execution Times
Pipeline Example
Figure 4 shows an example of the 68060 pipeline operation. The code
shown comes from a commercially available compiler and represents the
inner SAXPY loop from the matrix300 program from the SPEC89 benchmark suite.
Since the OEPs are decoupled from the IFP, this example only focuses on
the OEPs.
This loop executes 13 instructions in only ten clock cycles,
producing a steady-state performance of 0.77 clocks per instruction (CPI).
This code includes two multi-cycle FPU instructions (4-cycle FMUL and 3-cycle
FADD), but the superscalar micro-architecture is able to effectively exploit
the parallelism within the loop to achieve a less than one CPI measure.
This example code loop demonstrates several major architectural
features of the 68060. Of the 13 instructions, the 68060 dispatches four
groups of 2-instruction pairs (at cycles 1, 2, 4, 5), one group of three
instructions (at cycle 9) and two individual instructions (at cycles 3 and 8).
At cycle 3, the pair of instructions being examined is {pOEP = lsl.l,
sOEP = fadd.d}. Since all floating-point instructions must issue into the
pOEP, the fadd.d does not issue into the sOEP. On the next cycle, a new
2-instruction pair is examined {pOEP = fadd.d, sOEP = add.l}, and at this
time, both instructions issue down the OEPs. At cycles 6 and 7, the pipeline
stalls on the fadd.d instruction as the 4-cycle fmul completes execution. The
floating-point store operation at cycle 8 inhibits any sOEP dispatch because
of certain post-exception fault possibilities. At cycle 9, an instruction
triplet is dispatched {add.l, subq.l, bcc.b}. Recall the branch cache
utilizes various instruction folding techniques that effectively allow this
predicted as taken branch to execute in 0 cycles. Finally, at cycle 10, the
pipeline stalls for one clock on the floating-point store instruction as it
waits for the completion of the three-cycle fadd.
Power Management On Chip
With 2.4 million transistors operating at frequencies of 50 MHz and
higher, power management becomes a crucial issue on the 68060. From the
inception, the 68060 focused on minimizing chip-level power dissipation. There
are primarily three different areas of interest for power dissipation.
The 68060 operates from a 3.3 volt power supply. Since power
dissipation is a function of the square of the power supply voltage, simply
changing the power supply voltage to 3.3 volts results in a 56% reduction
in power compared to a 5 volt power supply. In addition to a lower supply
voltage, the 68060 is a completely static design. The 68060's operating
frequency, which linearly affects chip-level power dissipation, can vary
dynamically down toward the DC range. Although the 68060 is a 3.3 volt part,
its I/O buffers interface to either 3 volt or 5 volt peripherals and memory,
facilitating upgrades of existing designs.
Sophisticated power management circuitry on chip dynamically controls
and minimizes power consumption. This circuitry selectively updates modules
on the 68060 on a clock-by-clock basis, dynamically shuttung off the circuits
not required to support the activities in the current clock cycle. Entire
areas of the 68060 can shut off for long periods of time when they are not
required.
The 68060 also incorporates the LPSTOP instruction. This instruction
effectively puts the 68060 into a low-power sleep mode in which it stays until
awakened by an externally generated interrupt. Data on previous members of the
M68000 Family shows that use of the LPSTOP instruction can extend battery
life in portable applications by over 250%.
Summary
The 68060 relies on new as well as standard architectural techniques
to extend the performance of the M68000 Family product line. Performance
simulations predict that between 3 and 3.5 times a 25 MHz 68040 are possible
using existing object code.
The 68060 relies on a deep internal pipeline and a superscalar
internal architecture coupled with 8 Kbyte instruction and data caches, a
256-entry branch cache, on-chip MMUs and an on-chip FPU to bring new levels
of performance to the M68000 Family architecture.
Power management is very important on the 68060, and this design uses
dynamic power management techniques to minimize power consumption. The 68060
operates from a 3.3 volt power supply, which greatly reduces its power
dissipation. Although the 68060 operates at a lower operating voltage, it
interfaces to both 3 volt and 5 volt peripherals and logic.
In addition to providing full application object code compatibility
with previous CPUs in this family, the 68060 provides a superset of 68040
hardware functionality. Designs compatible with existing and future 68040
systems are simple, and higher frequency designs are possible using a new
bus interface protocol.
Acknowledgements
The authors would like to thank all members of the 68060 design team
and management Without their concerted team effort, this project and this
paper would not have been possible.
References
Bernal, R.D. and Circello, J.C., "Putting RISC Efficiency To Work in CISC
Architectures," VLSI Systems Design, September 1987, pp. 46-51.
Circello, J.C. et al, "Refined Method Brings Precision to Performance
Analysis," Computer Design, March 1, 1989, pp. 77-82.
Edenfield, R.W. et al, "The 68040 Processor, Part 1, Design and
Implementation," IEEE MICRO, February 1990, pp. 66-78.
Diefendorff, K. and Allen, M., "Organization of the Motorola 88110
Superscalar RISC Microprocessor," IEEE MICRO, April 1992, pp. 40-63.
Hennessy, J. and Patterson, D., "Computer Architecture: A Quantitative
Approach," Morgan Kaufmann Publishers, Inc., San Mateo, CA, 1990.